# TinyStories

This tutorial demonstrates the usage of NeMo Curator's Python API to curate the [TinyStories](https://arxiv.org/abs/2305.07759) dataset. TinyStories is a dataset of short stories generated by GPT-3.5 and GPT-4, featuring words that are understood by 3 to 4-year olds. The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine.

For simplicity, this tutorial uses the validation split of this dataset, which contains around 22,000 samples.

## Usage
After installing the NeMo Curator package, you can simply run the following command:
```
python tutorials/tinystories/main.py
```

This will download the validation split of the TinyStories dataset and begin the data curation pipeline.
